55 research outputs found
Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition
We present a novel integration of an instruction-tuned large language model
(LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can
perform a wide range of linguistic tasks within zero-shot learning when
provided with a precise instruction or a prompt to guide the text generation
process towards the desired task. We explore using this zero-shot capability of
LLMs to extract linguistic information that can contribute to improving ASR
performance. Specifically, we direct an LLM to correct grammatical errors in an
ASR hypothesis and harness the embedded linguistic knowledge to conduct
end-to-end ASR. The proposed model is built on the hybrid connectionist
temporal classification (CTC) and attention architecture, where an
instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder.
An ASR hypothesis, subject to correction, is obtained from the encoder via CTC
decoding, which is then fed into the LLM along with an instruction. The decoder
subsequently takes as input the LLM embeddings to perform sequence generation,
incorporating acoustic information from the encoder output. Experimental
results and analyses demonstrate that the proposed integration yields promising
performance improvements, and our approach largely benefits from LLM-based
rescoring.Comment: Submitted to ICASSP202
InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss
This paper presents InterMPL, a semi-supervised learning method of end-to-end
automatic speech recognition (ASR) that performs pseudo-labeling (PL) with
intermediate supervision. Momentum PL (MPL) trains a connectionist temporal
classification (CTC)-based model on unlabeled data by continuously generating
pseudo-labels on the fly and improving their quality. In contrast to
autoregressive formulations, such as the attention-based encoder-decoder and
transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in
general, owing to its simple/fast inference algorithm and robustness against
generating collapsed labels. However, CTC generally yields inferior performance
than the autoregressive models due to the conditional independence assumption,
thereby limiting the performance of MPL. We propose to enhance MPL by
introducing intermediate loss, inspired by the recent advances in CTC-based
modeling. Specifically, we focus on self-conditional and hierarchical
conditional CTC, that apply auxiliary CTC losses to intermediate layers such
that the conditional independence assumption is explicitly relaxed. We also
explore how pseudo-labels should be generated and used as supervision for
intermediate losses. Experimental results in different semi-supervised settings
demonstrate that the proposed approach outperforms MPL and improves an ASR
model by up to a 12.1% absolute performance gain. In addition, our detailed
analysis validates the importance of the intermediate loss.Comment: Submitted to ICASSP202
BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder
We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech
recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced
encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR
has been actively studied, aiming to utilize versatile linguistic knowledge for
generating accurate text. One crucial factor that makes this integration
challenging lies in the vocabulary mismatch; the vocabulary constructed for a
pre-trained LM is generally too large for E2E-ASR training and is likely to
have a mismatch against a target ASR domain. To overcome such an issue, we
propose BECTRA, an extended version of our previous BERT-CTC, that realizes
BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based
model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder
using a vocabulary suitable for a target task. With the combination of the
transducer and BERT-CTC, we also propose a novel inference algorithm for taking
advantage of both autoregressive and non-autoregressive decoding. Experimental
results on several ASR tasks, varying in amounts of data, speaking styles, and
languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing
with the vocabulary mismatch while exploiting BERT knowledge.Comment: Submitted to ICASSP202
Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Achieving high accuracy with low latency has always been a challenge in
streaming end-to-end automatic speech recognition (ASR) systems. By attending
to more future contexts, a streaming ASR model achieves higher accuracy but
results in larger latency, which hurts the streaming performance. In the
Mask-CTC framework, an encoder network is trained to learn the feature
representation that anticipates long-term contexts, which is desirable for
streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in
achieving low latency and high accuracy for triggered attention-based ASR.
However, the effectiveness of this method has not been demonstrated for various
model architectures, nor has it been verified that the encoder has the expected
look-ahead capability to reduce latency. This study, therefore, examines the
effectiveness of Mask-CTCbased pre-training for models with different
architectures, such as Transformer-Transducer and contextual block streaming
ASR. We also discuss the effect of the proposed pre-training method on
obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202
Conversation-oriented ASR with multi-look-ahead CBS architecture
During conversations, humans are capable of inferring the intention of the
speaker at any point of the speech to prepare the following action promptly.
Such ability is also the key for conversational systems to achieve rhythmic and
natural conversation. To perform this, the automatic speech recognition (ASR)
used for transcribing the speech in real-time must achieve high accuracy
without delay. In streaming ASR, high accuracy is assured by attending to
look-ahead frames, which leads to delay increments. To tackle this trade-off
issue, we propose a multiple latency streaming ASR to achieve high accuracy
with zero look-ahead. The proposed system contains two encoders that operate in
parallel, where a primary encoder generates accurate outputs utilizing
look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion
of the primary encoder without look-ahead. The proposed system is constructed
based on contextual block streaming (CBS) architecture, which leverages block
processing and has a high affinity for the multiple latency architecture.
Various methods are also studied for architecting the system, including
shifting the network to perform as different encoders; as well as generating
both encoders' outputs in one encoding pass.Comment: Submitted to ICASSP202
BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model
This paper presents BERT-CTC, a novel formulation of end-to-end speech
recognition that adapts BERT for connectionist temporal classification (CTC).
Our formulation relaxes the conditional independence assumptions used in
conventional CTC and incorporates linguistic knowledge through the explicit
output dependency obtained by BERT contextual embedding. BERT-CTC attends to
the full contexts of the input and hypothesized output sequences via the
self-attention mechanism. This mechanism encourages a model to learn
inner/inter-dependencies between the audio and token representations while
maintaining CTC's training efficiency. During inference, BERT-CTC combines a
mask-predict algorithm with CTC decoding, which iteratively refines an output
sequence. The experimental results reveal that BERT-CTC improves over
conventional approaches across variations in speaking styles and languages.
Finally, we show that the semantic representations in BERT-CTC are beneficial
towards downstream spoken language understanding tasks.Comment: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and
clearer derivation of Eq. (21
TurkScanner: Predicting the Hourly Wage of Microtasks
Workers in crowd markets struggle to earn a living. One reason for this is
that it is difficult for workers to accurately gauge the hourly wages of
microtasks, and they consequently end up performing labor with little pay. In
general, workers are provided with little information about tasks, and are left
to rely on noisy signals, such as textual description of the task or rating of
the requester. This study explores various computational methods for predicting
the working times (and thus hourly wages) required for tasks based on data
collected from other workers completing crowd work. We provide the following
contributions. (i) A data collection method for gathering real-world training
data on crowd-work tasks and the times required for workers to complete them;
(ii) TurkScanner: a machine learning approach that predicts the necessary
working time to complete a task (and can thus implicitly provide the expected
hourly wage). We collected 9,155 data records using a web browser extension
installed by 84 Amazon Mechanical Turk workers, and explored the challenge of
accurately recording working times both automatically and by asking workers.
TurkScanner was created using ~150 derived features, and was able to predict
the hourly wages of 69.6% of all the tested microtasks within a 75% error.
Directions for future research include observing the effects of tools on
people's working practices, adapting this approach to a requester tool for
better price setting, and predicting other elements of work (e.g., the
acceptance likelihood and worker task preferences.)Comment: Proceedings of the 28th International Conference on World Wide Web
(WWW '19), San Francisco, CA, USA, May 13-17, 201
- …